FINAL REPORT

Problem statement:

A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, you want to sell a house and you don’t know the price which you can take — it can’t be too low or too high. To find house price you usually try to find similar properties in your neighbourhood and based on gathered data you will try to assess your house price.

1. Summary of problem statement, data and findings

The proposed project is about determining the ‘price’ of a house based on all the features of the available dataset and not only on the location and square footage. The problem statement also clearly states that the value of the house is not being predicted only from the buyer perspective but also from the seller perspective with the given dataset to decide on the correct ‘price’ to be tagged for the house. To have a clear understanding of the given dataset we started our work by analyzing it in excel workbook using filter which gave some interesting insights on the given dataset,

  • The room_bed attribute had a value 33, which when compared with the total area, indicates this value could be an outlier.
  • The room_bath(no of bathrooms/no of bedrooms) attribute has some integer non zero value when the room_bed has ‘0’ value, which indicates wrong values in the data set.
  • The attribute lot_measure15 is only a part of the total area, but in few cases the value is greater than total_area which is not possible.
  • The attribute ceil (Total floors (levels) in the house) has few values in decimals. How can the total floors be in decimals
  • The attribute year_renovated has few values ‘0’ but the lot_measure15 has greater value than lot_measure in few cases. This indicates that few values of ‘0’ might be missing values in year_innovated.
  • The correlation between attributes is not clear from excel workbook, which can be seen using python notebook.

  • With this inference it made us easy to start our analysis for the given dataset.

  • Note: Since we have one target variable – ‘price’ and all other attributes are independent variables, we will be training our model based on the independent variables. So this problem lies under Supervised learning method.

Summary of the Approach to EDA and Pre-processing

UNIVARIATE ANALYSIS

In [36]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

from sklearn.decomposition import PCA
from scipy.stats import zscore
In [37]:
inn_df = pd.read_csv("innercity.csv")  
In [3]:
inn_df
Out[3]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... basement yr_built yr_renovated zipcode lat long living_measure15 lot_measure15 furnished total_area
0 3034200666 20141107T000000 808100 4 3.25 3020 13457 1.0 0 0 ... 0 1956 0 98133 47.7174 -122.336 2120 7553 1 16477
1 8731981640 20141204T000000 277500 4 2.50 2550 7500 1.0 0 0 ... 800 1976 0 98023 47.3165 -122.386 2260 8800 0 10050
2 5104530220 20150420T000000 404000 3 2.50 2370 4324 2.0 0 0 ... 0 2006 0 98038 47.3515 -121.999 2370 4348 0 6694
3 6145600285 20140529T000000 300000 2 1.00 820 3844 1.0 0 0 ... 0 1916 0 98133 47.7049 -122.349 1520 3844 0 4664
4 8924100111 20150424T000000 699000 2 1.50 1400 4050 1.0 0 0 ... 0 1954 0 98115 47.6768 -122.269 1900 5940 0 5450
5 5525400430 20140715T000000 585000 3 2.50 2050 11690 2.0 0 0 ... 0 1989 0 98059 47.5279 -122.161 2410 10172 1 13740
6 2419600075 20141201T000000 465000 3 1.75 1480 6360 1.0 0 0 ... 0 1954 0 98133 47.7311 -122.353 1480 6360 0 7840
7 114101161 20140829T000000 480000 3 1.50 2100 67269 1.0 0 0 ... 880 1949 0 98028 47.7592 -122.230 1610 15999 0 69369
8 7011201550 20140707T000000 780000 4 2.00 2600 4800 1.0 0 2 ... 1200 1953 0 98119 47.6370 -122.371 2050 3505 0 7400
9 7203000640 20140918T000000 215000 4 1.00 1130 7400 1.0 0 0 ... 0 1969 0 98003 47.3437 -122.316 1540 7379 0 8530
10 7518503685 20141009T000000 402000 2 1.00 710 5100 1.0 0 0 ... 0 1905 0 98117 47.6765 -122.381 1530 5100 0 5810
11 7300400150 20141027T000000 299000 4 2.50 2350 6958 2.0 0 0 ... 0 1998 0 98092 47.3321 -122.172 2480 6395 1 9308
12 2215800050 20150415T000000 785000 4 2.50 3440 56192 2.0 0 0 ... 0 1994 0 98053 47.6969 -122.046 3150 44431 1 59632
13 7443000480 20150507T000000 865000 4 2.00 2750 5527 2.0 0 0 ... 620 1901 1987 98119 47.6513 -122.368 1290 1764 0 8277
14 5072100095 20141117T000000 554000 5 2.50 3440 12900 1.0 0 2 ... 1720 1958 0 98166 47.4426 -122.342 2100 10751 0 16340
15 1387301730 20150202T000000 361000 3 1.50 1200 7236 1.0 0 0 ... 0 1975 0 98011 47.7390 -122.194 1680 7800 0 8436
16 1310430130 20141009T000000 459000 4 2.75 2790 6600 2.0 0 0 ... 0 2000 0 98058 47.4362 -122.109 2900 6752 1 9390
17 3352400351 20141121T000000 200000 3 1.00 1480 5600 1.0 0 0 ... 540 1947 0 98178 47.5045 -122.270 1350 11100 0 7080
18 3678900110 20140610T000000 403000 2 1.00 1100 3598 1.0 0 0 ... 0 1926 0 98144 47.5738 -122.313 1240 3598 0 4698
19 2474400250 20140630T000000 327500 3 2.25 2310 7200 2.0 0 0 ... 0 1990 0 98031 47.4051 -122.193 1960 7201 0 9510
20 8820900029 20140610T000000 700000 5 2.75 3100 9825 2.0 0 2 ... 0 1950 1982 98125 47.7188 -122.281 2120 8400 0 12925
21 263000050 20141031T000000 730000 3 2.50 2160 8809 1.0 0 0 ... 620 2014 0 98103 47.6994 -122.349 930 5420 1 10969
22 9406500350 20141229T000000 207000 2 1.50 1068 1158 2.0 0 0 ... 0 1990 0 98028 47.7530 -122.244 1078 1278 0 2226
23 9533100145 20150205T000000 750000 3 1.00 1120 8549 1.0 0 0 ... 0 1952 0 98004 47.6294 -122.205 1440 8640 0 9669
24 5694500105 20141204T000000 595000 2 2.00 1510 4000 1.0 0 0 ... 500 1900 0 98103 47.6582 -122.345 1920 4000 0 5510
25 3291800710 20141120T000000 338000 4 3.00 2090 7500 1.0 0 0 ... 720 1986 0 98056 47.4888 -122.182 1810 7650 0 9590
26 9126100815 20141217T000000 500000 3 2.00 1560 1156 3.0 0 0 ... 0 2014 0 98122 47.6050 -122.304 1560 1728 0 2716
27 3416600800 20150209T000000 834000 4 2.50 2370 4000 1.5 0 2 ... 390 1928 0 98144 47.6010 -122.294 2440 5750 0 6370
28 7855000460 20141007T000000 1450000 3 2.75 3940 9671 1.0 0 4 ... 1800 1967 0 98006 47.5654 -122.158 3390 9360 1 13611
29 6204410330 20141020T000000 432000 4 1.75 2410 8400 1.0 0 0 ... 810 1978 0 98011 47.7341 -122.200 1850 8400 0 10810
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21583 1074100110 20140525T000000 355300 3 2.50 1620 7410 1.0 0 0 ... 0 1955 0 98133 47.7708 -122.335 1450 8121 0 9030
21584 1626079066 20140806T000000 290000 2 1.00 1120 217800 1.0 0 0 ... 0 1976 0 98019 47.7378 -121.912 1480 217800 0 218920
21585 2738600140 20140502T000000 499950 4 2.50 2860 3345 2.0 0 0 ... 670 2004 0 98072 47.7735 -122.158 2860 3596 0 6205
21586 4068300280 20140708T000000 255000 3 1.75 1550 9720 1.0 0 0 ... 500 1976 0 98010 47.3433 -122.037 1550 9750 0 11270
21587 1862400057 20150304T000000 320000 2 1.00 820 5400 1.0 0 0 ... 0 1940 0 98117 47.6976 -122.375 1370 5632 0 6220
21588 9352901085 20150204T000000 256000 3 1.00 1290 4720 1.0 0 0 ... 500 1948 0 98106 47.5186 -122.358 1110 4720 0 6010
21589 8856940060 20150227T000000 374950 4 2.75 2730 4683 2.0 0 0 ... 0 2005 0 98038 47.3608 -122.043 2230 4924 0 7413
21590 1526079026 20140813T000000 487500 5 3.50 3530 218472 2.0 0 0 ... 1150 1999 0 98019 47.7309 -121.905 2110 211404 0 222002
21591 1604600227 20150328T000000 441000 2 1.00 1150 3000 1.0 0 0 ... 370 1915 0 98118 47.5624 -122.291 1150 5000 0 4150
21592 7203150330 20140717T000000 669000 4 2.50 2470 4945 2.0 0 0 ... 0 2012 0 98053 47.6898 -122.015 2510 4988 0 7415
21593 8820903555 20150105T000000 467500 3 1.00 1830 6453 1.0 0 0 ... 0 1956 0 98125 47.7139 -122.288 1670 8012 0 8283
21594 579000595 20140906T000000 724000 2 1.00 1560 5000 1.5 0 1 ... 0 1942 0 98117 47.7006 -122.386 2620 5400 0 6560
21595 2734100835 20150303T000000 90000 1 1.00 780 4000 1.0 0 0 ... 0 1905 0 98108 47.5424 -122.321 1150 4000 0 4780
21596 2698200210 20140908T000000 274000 3 1.75 1440 7198 1.0 0 0 ... 450 1981 0 98055 47.4333 -122.194 1550 7156 0 8638
21597 7821200375 20150126T000000 432000 2 1.00 960 3235 1.0 0 0 ... 0 1916 0 98103 47.6610 -122.344 1290 2069 0 4195
21598 3904100089 20140801T000000 190000 3 1.75 1350 7370 1.0 0 0 ... 0 1912 0 98118 47.5336 -122.278 1440 6000 0 8720
21599 2589300065 20140916T000000 329900 3 1.75 1670 5209 1.5 0 0 ... 0 1908 0 98118 47.5362 -122.271 1990 4960 0 6879
21600 1446400715 20150422T000000 280000 2 1.00 1310 6600 1.0 0 0 ... 0 1942 0 98168 47.4834 -122.332 1240 6600 0 7910
21601 6752600130 20150413T000000 351000 4 2.50 2370 7274 2.0 0 0 ... 0 1997 0 98031 47.3982 -122.171 2090 7656 0 9644
21602 9433000480 20140922T000000 799950 4 3.50 3030 5494 3.0 0 0 ... 0 2014 0 98052 47.7103 -122.109 2910 5314 1 8524
21603 7974700112 20140714T000000 650000 4 2.50 2530 6500 1.5 0 0 ... 810 1975 0 98115 47.6737 -122.284 2150 5280 0 9030
21604 2115720130 20140821T000000 289950 3 2.50 2070 5013 2.0 0 0 ... 0 1987 0 98023 47.3202 -122.395 1670 5013 0 7083
21605 1727500340 20140614T000000 397500 3 2.00 1510 6710 1.0 0 0 ... 440 1972 0 98034 47.7193 -122.216 1660 6600 0 8220
21606 1517900100 20141021T000000 499000 4 2.50 2680 10590 2.0 0 0 ... 0 2004 0 98019 47.7377 -121.970 2330 5566 0 13270
21607 7942601435 20150324T000000 835000 6 2.00 3560 5120 2.5 0 2 ... 0 1900 0 98122 47.6056 -122.311 2130 5120 1 8680
21608 5137800030 20140701T000000 300000 4 2.50 2303 3826 2.0 0 0 ... 0 2006 0 98092 47.3258 -122.165 2516 4500 0 6129
21609 8562890910 20140619T000000 320000 4 2.50 3490 5000 2.0 0 0 ... 0 2003 0 98042 47.3772 -122.127 2910 5025 0 8490
21610 1442880160 20140627T000000 483453 4 2.75 2790 5527 2.0 0 0 ... 0 2014 0 98045 47.4827 -121.773 2620 5509 0 8317
21611 622100130 20140917T000000 365000 2 2.00 1440 15000 1.0 0 0 ... 0 1985 0 98072 47.7648 -122.159 1780 15000 0 16440
21612 6413600276 20150324T000000 354950 3 1.00 970 5922 1.5 0 0 ... 0 1949 0 98125 47.7190 -122.321 1730 6128 0 6892

21613 rows × 23 columns

In [4]:
inn_df.describe()
Out[4]:
cid price room_bed room_bath living_measure lot_measure ceil coast sight condition ... basement yr_built yr_renovated zipcode lat long living_measure15 lot_measure15 furnished total_area
count 2.161300e+04 2.161300e+04 21613.000000 21613.000000 21613.000000 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000 ... 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 2.161300e+04
mean 4.580302e+09 5.401822e+05 3.370842 2.114757 2079.899736 1.510697e+04 1.494309 0.007542 0.234303 3.409430 ... 291.509045 1971.005136 84.402258 98077.939805 47.560053 -122.213896 1986.552492 12768.455652 0.196687 1.718687e+04
std 2.876566e+09 3.673622e+05 0.930062 0.770163 918.440897 4.142051e+04 0.539989 0.086517 0.766318 0.650743 ... 442.575043 29.373411 401.679240 53.505026 0.138564 0.140828 685.391304 27304.179631 0.397503 4.158908e+04
min 1.000102e+06 7.500000e+04 0.000000 0.000000 290.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 ... 0.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000 0.000000 1.423000e+03
25% 2.123049e+09 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 ... 0.000000 1951.000000 0.000000 98033.000000 47.471000 -122.328000 1490.000000 5100.000000 0.000000 7.035000e+03
50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 ... 0.000000 1975.000000 0.000000 98065.000000 47.571800 -122.230000 1840.000000 7620.000000 0.000000 9.575000e+03
75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 2.000000 0.000000 0.000000 4.000000 ... 560.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000 0.000000 1.300000e+04
max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 ... 4820.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000 1.000000 1.652659e+06

8 rows × 22 columns

In [5]:
inn_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 23 columns):
cid                 21613 non-null int64
dayhours            21613 non-null object
price               21613 non-null int64
room_bed            21613 non-null int64
room_bath           21613 non-null float64
living_measure      21613 non-null int64
lot_measure         21613 non-null int64
ceil                21613 non-null float64
coast               21613 non-null int64
sight               21613 non-null int64
condition           21613 non-null int64
quality             21613 non-null int64
ceil_measure        21613 non-null int64
basement            21613 non-null int64
yr_built            21613 non-null int64
yr_renovated        21613 non-null int64
zipcode             21613 non-null int64
lat                 21613 non-null float64
long                21613 non-null float64
living_measure15    21613 non-null int64
lot_measure15       21613 non-null int64
furnished           21613 non-null int64
total_area          21613 non-null int64
dtypes: float64(4), int64(18), object(1)
memory usage: 3.8+ MB
  • From the dataset we can clearly see that the attributes have data types of int, float and object.
  • The attribute dayhours is of the type object, which shows the date of the house sold.
  • Before starting the univariate analysis we can remove the cid column and dayhours which is not going have any effect on the the target column, (i;e predicting the price of the house).
In [3]:
inn_df = inn_df.drop(['cid'], axis=1)
In [4]:
inn_df = inn_df.drop(['dayhours'], axis=1)

Univariate Analysis

The univariate analysis can be started by using different basic plots.

In [5]:
inn_df1=inn_df
from scipy.stats import zscore
inn_df1 =inn_df.apply(zscore)
inn_df1.head()
Out[5]:
price room_bed room_bath living_measure lot_measure ceil coast sight condition quality ... basement yr_built yr_renovated zipcode lat long living_measure15 lot_measure15 furnished total_area
0 0.729318 0.676485 1.474063 1.023606 -0.039835 -0.915427 -0.087173 -0.305759 2.444294 1.142667 ... -0.658681 -0.510853 -0.210128 1.029090 1.135587 -0.867059 0.194707 -0.191018 2.020944 -0.017069
1 -0.715066 0.676485 0.500221 0.511858 -0.183656 -0.915427 -0.087173 -0.305759 -0.629187 0.291916 ... 1.148964 0.170051 -0.210128 -1.026840 -1.757734 -1.222109 0.398975 -0.145346 -0.494818 -0.171608
2 -0.370711 -0.398737 0.500221 0.315869 -0.260335 0.936506 -0.087173 -0.305759 -0.629187 0.291916 ... -0.658681 1.191407 -0.210128 -0.746486 -1.505137 1.525981 0.559471 -0.308402 -0.494818 -0.252304
3 -0.653817 -1.473959 -1.447464 -1.371813 -0.271924 -0.915427 -0.087173 -0.305759 0.907554 -1.409587 ... -0.658681 -1.872660 -0.210128 1.029090 1.045374 -0.959372 -0.680725 -0.326861 -0.494818 -0.301116
4 0.432329 -1.473959 -0.798235 -0.740293 -0.266950 -0.915427 -0.087173 -0.305759 0.907554 0.291916 ... -0.658681 -0.578943 -0.210128 0.692665 0.842574 -0.391291 -0.126285 -0.250094 -0.494818 -0.282217

5 rows × 21 columns

In [9]:
plt.figure(figsize=(10, 5))
sns.distplot(inn_df['price'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e5ac94e0>

*The above target column is skewed towards right, because of outliers.

*To check whether it is normally distributed or not let us use skewness and kurtosis test.

In [10]:
from scipy.stats import skew
from scipy.stats import kurtosis
skew(inn_df['price'])
Out[10]:
4.021436449851422
In [11]:
kurtosis(inn_df['price'])
Out[11]:
34.514180829647714
  • From both these values it is very clear that the price column is skewed and is not normally distributed
In [12]:
plt.figure(figsize=(14, 5))
counts, bin_edges = np.histogram(inn_df['price'], bins=10, density=True)
plt.xlabel('toom_bed')
pdf= counts/(sum(counts))
print("pdf=", pdf);
print("bin_edges=", bin_edges);
cdf = np.cumsum(pdf)
print("cdf=", cdf);
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:],cdf);
pdf= [8.80072179e-01 9.91070189e-02 1.48984408e-02 4.16416046e-03
 1.20297969e-03 1.85073798e-04 1.85073798e-04 4.62684495e-05
 4.62684495e-05 9.25368991e-05]
bin_edges= [  75000.  837500. 1600000. 2362500. 3125000. 3887500. 4650000. 5412500.
 6175000. 6937500. 7700000.]
cdf= [0.88007218 0.9791792  0.99407764 0.9982418  0.99944478 0.99962985
 0.99981493 0.99986119 0.99990746 1.        ]

*The orange plot shows the cdf,From the plot it is very clear that around 90% of the houses have a cumulative price range around 10,00,000. So the probability of house to be bought will be in and around 10,00,000.

In [13]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['room_bed'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e6a9f668>

*From the plot we can see a lot of cluters with uneven peaks, which suggest the distribution not normal. This may be due to outliers or missing value in the data.

*When the data is analyzed using filter in excel we were able to see some zeros, but when it was related with other features we see that those zeros might be missing values.

*Apart from that we can see an outlier, which when related with other features semms to be an error value.

In [14]:
plt.figure(figsize=(14, 5))
sns.distplot(inn_df['room_bath'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e69abf28>

*room_bath denotes (Number of bathrooms/bedrooms) so it depends on the no of bedrooms. Few cases the no of bathrooms are said to be zero but the room_bed has got some integer value which is not possible. So zero denotes missing value, which is mentioned above

In [15]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['living_measure'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e6be3780>

*Living measure denotes the total area constructed. The distance plot is also skewed toward right.

In [16]:
skew(inn_df['living_measure'])
Out[16]:
1.4714532949510906
In [17]:
kurtosis(inn_df['living_measure'])
Out[17]:
5.241602521613769
In [18]:
plt.figure(figsize=(6, 5))
sns.distplot(inn_df['lot_measure'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e6c25438>

*lot_measure is square footage of the lot, which is completely right skewed.

In [19]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['ceil'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e6d00f28>

ceil: Denotes total floor in the house. we can clearly see most of the houses has 1 and 2 floors. But few houses have 1.5,2.5 and 3.5 floors, which is confusing. Is the data wrong or few houses has half floor constructed. But how can the no of levels or floors be considered half?

In [20]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['coast'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e6f1ec50>

*There are only few houses near the coast.

In [21]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['sight'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e6fcc438>

*From this it is very clear that most number of houses are not visited and maximum time a house was visted was 4.

In [22]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['condition'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e70a9c50>

condition: How good the condition is (Overall). Based on this every house is rated on a scale of 5. It is very clear that most of the house has a rating of 3 and few houses has 4 and 5. But very few houses are rated low.

In [23]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['quality'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e71f47b8>

*quality: grade given to the housing unit, based on grading system(1-13) being 13 a better housing unit. Most housing units have the quality with a maximum grade of 7.

In [24]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['ceil_measure'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e68f1358>

*ceil_measure: square footage of house apart from basement: It has slight right skewed distribution.

In [25]:
skew(inn_df['ceil_measure'])
Out[25]:
1.4465640690628738
In [26]:
kurtosis(inn_df['ceil_measure'])
Out[26]:
3.4012389779605696

As kurtosis is greater than 2 then the distribution is not said to be linear

In [27]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['basement'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e6b675c0>

*basement_measure: square footage of the basement. The range varies from (0-4820)sqft. From the plot it seems to be most houses has less basement area.

In [28]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['yr_built'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e7292cf8>

yr_built: Built Year

In [29]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['yr_renovated'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e73adfd0>
*yr_renovated: Year when house was renovated *I was able to see a lot of zeros from the data set. When compared with the other attributes zero denotes the missing value.
In [30]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['zipcode'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e7417fd0>
In [31]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['lat'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e84ed6d8>
In [32]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['long'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e8586e10>

*longitue, latitude and pincode columns when analyzed doesnot have a big impact on the target column 'price'

In [33]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['living_measure15'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e860a3c8>

living_measure: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area. Here we can see the normal distribution of the plot.

In [34]:
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['lot_measure15'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e87416a0>

lot_measure15: lotSize area in 2015(implies-- some renovations)

In [35]:
plt.figure(figsize=(14, 5))
sns.distplot(inn_df['furnished'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e87cbbe0>

furnished: It is a catagoraical variable, it says whether the house is furnised or not with '0' and '1'. '0': Not furnished '1': Furnished Here we see many houses not being furnished compared to the furnished ones.

In [36]:
plt.figure(figsize=(6, 5))
sns.distplot(inn_df['total_area'], color='orange')
C:\Users\Gowtham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e88add68>

Comparing lot_measure and total_area, we see same distribution in case of historam.

From all the graph PDF represents the distribution curve, similar to histograms.

From the distance plot we can see that more than 10 attributes are categorical. Among those attributes many are having non-gaussian type distribution. The other attributes which are not categorical are mostly right skewed. So as most of the attributes are having non-gaussian type distribution they possess outliers. The outliers can be visually seen using boxplot.

Boxplot

In [37]:
inn_df.boxplot(figsize = (20,5))
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e8977828>

Normalized values will give clear picture about the outliers in the boxplot

In [38]:
inn_df1=inn_df
from scipy.stats import zscore
inn_df1 =inn_df.apply(zscore)
inn_df1.head()
Out[38]:
price room_bed room_bath living_measure lot_measure ceil coast sight condition quality ... basement yr_built yr_renovated zipcode lat long living_measure15 lot_measure15 furnished total_area
0 0.729318 0.676485 1.474063 1.023606 -0.039835 -0.915427 -0.087173 -0.305759 2.444294 1.142667 ... -0.658681 -0.510853 -0.210128 1.029090 1.135587 -0.867059 0.194707 -0.191018 2.020944 -0.017069
1 -0.715066 0.676485 0.500221 0.511858 -0.183656 -0.915427 -0.087173 -0.305759 -0.629187 0.291916 ... 1.148964 0.170051 -0.210128 -1.026840 -1.757734 -1.222109 0.398975 -0.145346 -0.494818 -0.171608
2 -0.370711 -0.398737 0.500221 0.315869 -0.260335 0.936506 -0.087173 -0.305759 -0.629187 0.291916 ... -0.658681 1.191407 -0.210128 -0.746486 -1.505137 1.525981 0.559471 -0.308402 -0.494818 -0.252304
3 -0.653817 -1.473959 -1.447464 -1.371813 -0.271924 -0.915427 -0.087173 -0.305759 0.907554 -1.409587 ... -0.658681 -1.872660 -0.210128 1.029090 1.045374 -0.959372 -0.680725 -0.326861 -0.494818 -0.301116
4 0.432329 -1.473959 -0.798235 -0.740293 -0.266950 -0.915427 -0.087173 -0.305759 0.907554 0.291916 ... -0.658681 -0.578943 -0.210128 0.692665 0.842574 -0.391291 -0.126285 -0.250094 -0.494818 -0.282217

5 rows × 21 columns

In [39]:
inn_df1.boxplot(figsize = (20,5))
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x158e8e1c0f0>
  • From the boxplot we can clearly see that except few attributes like ceil, yr_built & zipcode all the other attributes having outliers.
  • For better analysis of dataset the outliers have to be removed, this can done only when the no of outliers for each attribute is found.
In [40]:
Q1 = inn_df.quantile(0.25)
Q3 = inn_df.quantile(0.75)
In [41]:
IQR = Q3 - Q1
In [42]:
((inn_df < (Q1 - 1.5 * IQR)) | (inn_df > (Q3 + 1.5 * IQR))).sum()
Out[42]:
price               1159
room_bed             546
room_bath            571
living_measure       572
lot_measure         2425
ceil                   0
coast                163
sight               2124
condition             30
quality             1911
ceil_measure         611
basement             496
yr_built               0
yr_renovated         914
zipcode                0
lat                    2
long                 256
living_measure15     544
lot_measure15       2194
furnished           4251
total_area          2419
dtype: int64
  • This clearly gives us information on which attributes have more outliers.

BIVARIATE ANALYSIS

In [43]:
sns.pairplot(inn_df)
Out[43]:
<seaborn.axisgrid.PairGrid at 0x158e8e03da0>
In [44]:
correlation = inn_df.corr()
plt.figure(figsize=(20, 15))
sns.heatmap(correlation,annot=True, linewidth=0, vmin=-1)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x158f5349fd0>
  • From the heat map it is very clear that total_area and lot_measure has a correlation of '1' so any one column can be neglected.
  • living_measure & ceil_measure has a correlation of '.88' which means all the houses are almost built with a ceiling.
  • living_measure has also got a good relationship with quality & room_bath, which means with greater living capacity the quality is good and the no of bathrrooms/bedrooms increases
  • living_measure15 is just an extension of living_measure, so has good correlation
  • All the frunished houses have better quallity

  • living_measure, quality, ceil_measure, furnished & roon_bath have better correlation with the tatget variable ('price'). So for predicting the price of a property these attributes play a major role in this dataset

3. Deciding Models and Model Building

In [38]:
y = inn_df.iloc[:,0]
In [39]:
inn = inn_df.iloc[:,1:]
In [40]:
from sklearn import model_selection
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn, y, test_size=test_size, random_state=seed)
In [125]:
#Testing using Simple linear model 
In [126]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

regression_model.coef_
Out[126]:
array([-3.42965531e+04,  4.44280247e+04,  8.02199084e+01, -4.00647941e+01,
        6.85480021e+03,  4.96172063e+05,  5.43116427e+04,  2.91887065e+04,
        8.45032653e+04,  5.78472992e+01,  2.23726095e+01, -2.58062506e+03,
        2.21119213e+01, -5.90281160e+02,  6.16900000e+05, -2.12248589e+05,
        2.01079557e+01, -3.71107304e-01,  4.48608648e+04,  4.01551124e+01])
In [127]:
regression_model.intercept_
Out[127]:
7081869.056403663
In [128]:
regression_model.score(X_train, y_train)
Out[128]:
0.6993391229680337
In [129]:
regression_model.score(X_test, y_test)
Out[129]:
0.7014013348373487
  • With the existing dataset the model can get a score of 70% with the test data
In [16]:
# Testing using Ridge & Lasso method
In [130]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
In [131]:
ridge = Ridge(alpha=.3)
ridge.fit(X_train, y_train)
print ("Ridge model:", (ridge.coef_))
Ridge model: [-3.43066348e+04  4.44257778e+04  8.00413819e+01 -4.03387626e+01
  6.88814081e+03  4.94611790e+05  5.43742439e+04  2.91787640e+04
  8.45401407e+04  5.77376616e+01  2.23052819e+01 -2.58200692e+03
  2.21186209e+01 -5.89398004e+02  6.16109129e+05 -2.11821861e+05
  2.00981371e+01 -3.71414138e-01  4.48325421e+04  4.04288621e+01]
In [132]:
lasso = Lasso(alpha=0.2)
lasso.fit(X_train, y_train)
print ("Lasso model:", (lasso.coef_))
Lasso model: [-3.42964013e+04  4.44271587e+04  2.64923928e+02 -3.05002031e-01
  6.85460560e+03  4.96142059e+05  5.43127257e+04  2.91879288e+04
  8.45046602e+04 -8.70965992e+01 -1.22570597e+02 -2.58065467e+03
  2.21120919e+01 -5.90252406e+02  6.16886632e+05 -2.12230353e+05
  2.01075209e+01 -3.71119789e-01  4.48574988e+04  3.95314371e-01]
C:\Users\Gowtham\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
In [133]:
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
0.6993389244543258
0.7013657486911482
In [134]:
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))
0.699339122871955
0.7014006117815433

Model score obtained is 70% for both ridge & lasso methods

In [135]:
# Testing using Decision Tree Regressor
In [136]:
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeRegressor

d1 = DecisionTreeRegressor(max_depth = 10)
d1.fit(X_train, y_train)
Out[136]:
DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
In [137]:
d1.score(X_train, y_train)
Out[137]:
0.916127598460612
In [138]:
d1.score(X_test, y_test)
Out[138]:
0.7639751713858369
  • Model score obtined is 76%
In [139]:
# Testing using Gradient Boosting Method
In [157]:
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor()
dt_gb = dt_gb.fit(X_train, y_train)
In [158]:
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
Out[158]:
0.8716108082683545
In [142]:
# Testing using Random Forest Regressor
In [147]:
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor()
dt_rf = dt_rf.fit(X_train, y_train)
In [148]:
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
Out[148]:
0.8695394595852415

From all the model score we were able to find that Random Forest regressor & Gradient Boost Regressor methods have the highest model score of 87% for the existing dataset.

  • As we have chosen our model to be used for analyzing our dataset, we can now fine tune our model using different iterations which can increase the model score.

The different iterations through which we try to improve our model score are,

  • Removing the outliers using (mean+3*SD) on the existing dataset and developing a model using GB and RF method.
  • Replacing ‘Zeros’ from the attributes which are providing wrong information with their respective column medians and developing a model using GB and RF
  • Using PCA to analyse the minimum number of attributes needed for developing a better model using GB or RF which can provide atleast 95% variance.
  • Analysing the model performance using GB and RF by removing attributes.
  • Using Polynomial function on the existing dataset and test the model score using GB & RF
  • Changing the learning rate and the estimators in GB & RF

Iteration 1

  • Removing the outliers using (mean+3*SD) on the existing dataset and developing a model using GB and RF method.
In [58]:
a = np.mean(inn_df, axis=0)
b = np.std(inn_df, axis=0)
d = np.median(inn_df, axis=0)
c = (a+(3*b))
In [59]:
d
Out[59]:
array([ 4.50000e+05,  3.00000e+00,  2.25000e+00,  1.91000e+03,
        7.61800e+03,  1.50000e+00,  0.00000e+00,  0.00000e+00,
        3.00000e+00,  7.00000e+00,  1.56000e+03,  0.00000e+00,
        1.97500e+03,  0.00000e+00,  9.80650e+04,  4.75718e+01,
       -1.22230e+02,  1.84000e+03,  7.62000e+03,  0.00000e+00,
        9.57500e+03])
In [98]:
inn_new = np.where(inn_df > c, c, inn_df)
In [99]:
inn_new1 = pd.DataFrame(inn_new)
In [100]:
inn_new1.boxplot(figsize = (25,5))
Out[100]:
<matplotlib.axes._subplots.AxesSubplot at 0x2148de149b0>
In [101]:
inn_new2 =inn_new1.apply(zscore)
inn_new2.head()
Out[101]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 0.933625 0.712536 1.511426 1.094771 0.022413 -0.915976 -0.087173 -0.319192 2.444294 1.161933 ... -0.672881 -0.510853 -0.210135 1.029090 1.135587 -0.878686 0.207486 -0.247161 2.020944 0.066446
1 -0.836516 0.712536 0.516518 0.553969 -0.259460 -0.915976 -0.087173 -0.319192 -0.629187 0.299722 ... 1.197967 0.170051 -0.210135 -1.026840 -1.757734 -1.240516 0.417335 -0.163213 -0.494818 -0.234601
2 -0.414498 -0.410405 0.516518 0.346853 -0.409742 0.937603 -0.087173 -0.319192 -0.629187 0.299722 ... -0.672881 1.191407 -0.210135 -0.746486 -1.505137 1.560043 0.582216 -0.462923 -0.494818 -0.391799
3 -0.761454 -1.533346 -1.473297 -1.436643 -0.432455 -0.915976 -0.087173 -0.319192 0.907554 -1.424698 ... -0.672881 -1.872660 -0.210135 1.029090 1.045374 -0.972762 -0.691866 -0.496852 -0.494818 -0.486886
4 0.569655 -1.533346 -0.810025 -0.769270 -0.422707 -0.915976 -0.087173 -0.319192 0.907554 0.299722 ... -0.672881 -0.578943 -0.210135 0.692665 0.842574 -0.393835 -0.122276 -0.355749 -0.494818 -0.450069

5 rows × 21 columns

In [102]:
inn_new2.boxplot(figsize = (25,5))
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x21496ca5cc0>

From the plot it is clear that the no of outlier in each attribute has been reduced. But to get a clear view let us find the count of outliers for each attribute

In [81]:
Q1 = inn_new1.quantile(0.25)
Q3 = inn_new1.quantile(0.75)
In [82]:
IQR = Q3 - Q1
In [83]:
((inn_new1 < (Q1 - 1.5 * IQR)) | (inn_new1 > (Q3 + 1.5 * IQR))).sum()
Out[83]:
0      870
1      484
2      384
3      406
4     2205
5        0
6        0
7     1295
8       30
9     1808
10     453
11     425
12       0
13       0
14       0
15       2
16      40
17     382
18    1888
19    4251
20    2197
dtype: int64

Now this cleary says that the no of outliers in each attribute are less compared to the normal dataset.

In [84]:
# Let us now test the obtained dataset using GB and RF to find whether this have any effect on the performacne of the model
In [213]:
y1 = inn_new1.iloc[:,0]
In [236]:
inn1 = inn_new1.iloc[:,1:]
In [215]:
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn1, y1, test_size=test_size, random_state=seed)
In [216]:
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor()
dt_gb = dt_gb.fit(X_train, y_train)
In [217]:
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
Out[217]:
0.8800606837375257
In [218]:
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor()
dt_rf = dt_rf.fit(X_train, y_train)
In [219]:
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
Out[219]:
0.8723216762326628

Removing the outliers has increased the model score by 1% on GB and 0.4% on RF. So now we can proceed further with our analysis on the removed outliers dataset.

Iteration 2

Replacing the zeros of room_bath with the respective column median

In [ ]:
mean_room_bath = inn1['room_bath'].mean(skipna=True)
In [229]:
inn3=inn1.replace({'room_bath': {0: mean_room_bath}})
In [230]:
# Let us now test the obtained dataset using GB and RF to find whether this have any effect on the performacne of the model
In [231]:
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn3, y1, test_size=test_size, random_state=seed)
In [232]:
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor()
dt_gb = dt_gb.fit(X_train, y_train)
In [233]:
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
Out[233]:
0.8800727648391311
In [234]:
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor()
dt_rf = dt_rf.fit(X_train, y_train)
In [235]:
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
Out[235]:
0.8720671317877429

If we try to replace the missing values of room_bath with the respective column median the model doesn't seem to change, so we can leave this iteration and continue the dataset of the previous iteration

Itration 3

Tuning the learning rate and the estimators in GB & RF

In [271]:
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn1, y1, test_size=test_size, random_state=seed)
In [284]:
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor(n_estimators = 290, learning_rate=0.22)
dt_gb = dt_gb.fit(X_train, y_train)
In [285]:
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
Out[285]:
0.8994773479790737
In [278]:
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor(n_estimators = 290, max_depth = 100)
dt_rf = dt_rf.fit(X_train, y_train)
In [279]:
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
Out[279]:
0.8862534461873949

After no of iterations and testing we found that the no of estimators was 290 and learning rate was 0.22 in GB which gave us the best model score of 89.926%.

RF didn't perform that much compared to GB, so we are going to using GB with the further iterations and check whether our model can be tuned better

Iteration 4

Using PCA to find the optimal no of attributes to get 95% of variance

In [286]:
inn2=inn1.apply(zscore)
In [287]:
cov_matrix = np.cov(inn2.T)
In [288]:
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
  • Finding variance and cumulative variance by each eigen vector
In [289]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Cumulative Variance Explained [ 30.0405518   43.79393816  54.56148891  61.54709114  67.4556658
  72.48378316  76.99216907  81.1625612   84.93368109  87.97565566
  90.84856312  92.83916736  94.58472009  96.13622754  97.36537259
  98.35835967  99.22640057  99.98395086  99.99981274 100.        ]
In [290]:
plt.figure(figsize=(5,5))
plt.plot(var_exp)
Out[290]:
[<matplotlib.lines.Line2D at 0x2148e65e828>]
In [291]:
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
In [292]:
# Sort eigenvalues in descending order

# Make a set of (eigenvalue, eigenvector) pairs
eig_pairs = [(eig_vals[index], eig_vecs[:,index]) for index in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()

eig_pairs.reverse()
print(eig_pairs)

# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eig_vals))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eig_vals))]
[(6.0083883590653535, array([-0.22175415, -0.32102481, -0.36450696, -0.16933753, -0.204113  ,
       -0.04026399, -0.10208238,  0.07413573, -0.34439645, -0.36452029,
       -0.06464835, -0.22357132, -0.00297266,  0.15075977,  0.00579201,
       -0.18653255, -0.33472103, -0.16919704, -0.29969769, -0.18256539])), (2.750804546645037, array([ 0.11852041,  0.17296374,  0.10380281, -0.51378907,  0.20943408,
        0.00356859,  0.04147143, -0.07194686,  0.15132351,  0.08202085,
        0.0630939 ,  0.07499199,  0.01348288,  0.14668214,  0.14768553,
       -0.20079847,  0.04691874, -0.48991696,  0.10557019, -0.50406967])), (2.1536097948646473, array([-0.13366799, -0.02122324, -0.17532317, -0.08491258,  0.24782194,
       -0.19979304, -0.35309237, -0.26633783, -0.04679948,  0.05636046,
       -0.46484128,  0.39271298, -0.18938501, -0.28528238, -0.18647433,
        0.30757283, -0.08253031, -0.07781868, -0.05692421, -0.09159505])), (1.3971850924450782, array([ 0.34448907,  0.07785251,  0.10949759, -0.14279249, -0.31448443,
       -0.20705949, -0.13741626,  0.39528668, -0.10474005, -0.05913198,
        0.33450434, -0.00655818, -0.20330842, -0.41627624, -0.27411706,
        0.23026074,  0.05458249, -0.11515278, -0.14780003, -0.13678481])), (1.181769609413806, array([-0.13028897, -0.02449238, -0.06278552, -0.10704189, -0.01309136,
        0.63845999,  0.44688981, -0.0214698 , -0.04869689, -0.02657701,
       -0.07362596,  0.07959112,  0.16694368, -0.24453574, -0.47329761,
        0.11469115,  0.01575026, -0.07360026, -0.04563514, -0.10787687])), (1.0056700035561918, array([ 0.21243863,  0.09417705,  0.04697008, -0.00444494,  0.03291656,
       -0.17070044, -0.20840528, -0.21572349, -0.10329178,  0.04451797,
        0.01426039, -0.14695554,  0.85928004, -0.09510359, -0.12426291,
        0.0630887 , -0.07064067, -0.0193775 , -0.13012535, -0.00248615])), (0.9017189018987058, array([-0.23222308, -0.23695129, -0.03555688, -0.10648326, -0.08281781,
        0.02036905, -0.05221371,  0.44346908,  0.13477481,  0.14721057,
       -0.35277146, -0.31480745,  0.2091842 , -0.25423056,  0.32211277,
        0.24797516,  0.19710041, -0.05277422,  0.28387752, -0.10682917])), (0.834117020379768, array([-0.17029508, -0.04744729, -0.02338123, -0.05608603, -0.32423727,
        0.08198099,  0.07542789, -0.49172866,  0.00116461, -0.20436418,
        0.33646734,  0.14218952,  0.0270793 , -0.22565672,  0.49380212,
        0.34244642,  0.11442496, -0.0277171 , -0.02730416, -0.05615617])), (0.754258876436883, array([-0.30391644, -0.17778067, -0.00137595, -0.04146383, -0.25525251,
       -0.54280279,  0.27257543, -0.21567684,  0.17966568, -0.03558228,
        0.05842438, -0.04704046,  0.03734877,  0.01955038, -0.4302648 ,
       -0.12913005,  0.16053917, -0.03845086,  0.34929505, -0.04117279])), (0.6084230648081259, array([-0.30388831,  0.26281984, -0.08628362,  0.03500147,  0.36971978,
       -0.34682039,  0.43545523,  0.3578061 , -0.01919084, -0.16885334,
        0.14381322,  0.24739698,  0.17178419, -0.0951741 ,  0.1648205 ,
        0.11547611, -0.14492371,  0.01430181, -0.20803857,  0.030999  ])), (0.5746080779569309, array([-0.42140241,  0.14563032, -0.00200837,  0.02031558, -0.01656609,
        0.23377081, -0.53708679,  0.19500073,  0.2321524 , -0.17933748,
        0.33180821,  0.23651338,  0.14640587, -0.06514745, -0.12284251,
       -0.20106877, -0.12474718,  0.02309618,  0.26381633,  0.02000604])), (0.39813926820307877, array([-0.16138963,  0.04871734,  0.09188137, -0.08155967, -0.13220729,
        0.02458746, -0.10037122,  0.13917324, -0.09533044,  0.07294074,
        0.04700746,  0.11976763,  0.06717946,  0.69257535, -0.1639762 ,
        0.57442956,  0.16050211,  0.00874182, -0.07184471, -0.07724316])), (0.34912669970113347, array([ 0.43922079, -0.05431585, -0.24108209,  0.02092752, -0.18190831,
        0.00253464,  0.15182253,  0.09544032,  0.15859393, -0.20674931,
       -0.1097531 ,  0.28029334,  0.10073916,  0.1279599 ,  0.06406849,
        0.18183123, -0.48336951,  0.00943434,  0.47358692,  0.01018818])), (0.3103158492686729, array([-0.0294126 , -0.15119516,  0.07505979,  0.01596537,  0.51438753,
        0.02110628, -0.03509814, -0.1441186 , -0.07038994, -0.08431395,
        0.33131779, -0.48699593, -0.16107261, -0.03109547, -0.11629185,
        0.35305941, -0.27471577, -0.0693644 ,  0.28041267,  0.01861215])), (0.24584038460783067, array([-0.25649273,  0.51052317,  0.21022503,  0.04943019, -0.35122938,
       -0.01167807,  0.05820307, -0.08551149, -0.03617209,  0.31496896,
       -0.16377424, -0.23089251, -0.1076859 , -0.04576251,  0.02565016,
        0.05303576, -0.5112635 , -0.17395617,  0.0051588 ,  0.0570551 ])), (0.19860660412680026, array([-0.06315048, -0.6166685 ,  0.41292963,  0.04656696, -0.00407053,
       -0.01192499,  0.04070581,  0.06694469,  0.05364796,  0.37823044,
        0.14651514,  0.30897244,  0.07484666, -0.02029956,  0.04926851,
       -0.03907842, -0.32595706, -0.16112636, -0.16673139,  0.06253476])), (0.17361621313396056, array([ 0.04402645,  0.00719144, -0.13537725,  0.37495624, -0.00189891,
        0.02755598, -0.02602039,  0.02744303,  0.13666297, -0.13130568,
       -0.0281021 , -0.01060318,  0.01246408,  0.04383814, -0.0249848 ,
        0.05719862,  0.18858232, -0.78848674, -0.05962722,  0.36506774])), (0.15151706795234635, array([-0.03701369,  0.04007148,  0.09241617,  0.06054388,  0.01470157,
       -0.00202435,  0.01683456,  0.04813653, -0.81579634,  0.08678039,
        0.02859153,  0.21107805,  0.03982159, -0.03963646,  0.07472125,
       -0.12504032,  0.13553781, -0.13251256,  0.44638411,  0.06330294])), (0.0031725227998885184, array([ 1.89004504e-04, -2.33125026e-03,  6.96023603e-01,  1.35426253e-02,
        5.67742795e-03,  1.72943771e-03,  8.40996230e-04, -1.18462911e-03,
       -9.85781956e-04, -6.34694352e-01, -3.35103913e-01, -1.12671855e-03,
       -1.75524976e-04, -1.54805733e-03, -2.56600549e-04,  1.22076735e-03,
       -2.07808096e-03, -2.26608160e-03, -9.85640360e-05, -1.33302381e-02])), (3.74545440250138e-05, array([ 5.27159731e-05, -1.55587284e-04, -1.72200676e-03, -7.02871835e-01,
        2.29148641e-04, -3.30184479e-04, -1.69614197e-04,  7.04757105e-05,
       -1.85951549e-04, -2.47787762e-02, -1.33146979e-02,  1.42549204e-04,
       -7.72205087e-05,  1.95174299e-04,  3.13456134e-05,  2.67048481e-04,
       -5.79619283e-04, -4.37580563e-04,  3.04581199e-04,  7.10757297e-01]))]
In [293]:
P_reduce = np.array(eigvectors_sorted[0:17])   # Reducing from 18 to 7 dimension space

inn_4D = np.dot(inn2,P_reduce.T)   # projecting original data into principal component dimensions

inn_data_df = pd.DataFrame(inn_4D)  # converting array to dataframe for pairplot
In [294]:
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn_data_df, y1, test_size=test_size, random_state=seed)
In [295]:
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor(n_estimators = 290, learning_rate=0.22)
dt_gb = dt_gb.fit(X_train, y_train)
In [296]:
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
Out[296]:
0.8195331831890826

But from the imaginary PCA we can see that around 17 to 18 variables are needed to get 95% of variance, which when tested with the model, can give a score of 82%

The elbow curve is not clear, which is normally used to find the clusters in the dataset, which is not needed for this dataset as the target column is continuous

Iteration 5

In [351]:
inn_df = pd.read_csv("innercity.csv") 
In [352]:
inn_df = inn_df.drop(['cid'], axis=1)
In [353]:
inn_df = inn_df.drop(['dayhours'], axis=1)
In [354]:
#inn_df = inn_df.drop(['room_bed'], axis=1)
In [355]:
#inn_df = inn_df.drop(['room_bath'], axis=1)
In [356]:
#inn_df = inn_df.drop(['living_measure'], axis=1)
In [357]:
inn_df = inn_df.drop(['lot_measure'], axis=1)
In [358]:
#inn_df = inn_df.drop(['ceil'], axis=1)
In [359]:
#inn_df = inn_df.drop(['coast'], axis=1)
In [360]:
#inn_df = inn_df.drop(['sight'], axis=1)
In [361]:
#inn_df = inn_df.drop(['condition'], axis=1)
In [362]:
#inn_df = inn_df.drop(['quality'], axis=1)
In [363]:
#inn_df = inn_df.drop(['ceil_measure'], axis=1)
In [364]:
inn_df = inn_df.drop(['basement'], axis=1)
In [365]:
#inn_df = inn_df.drop(['yr_built'], axis=1)
In [366]:
#inn_df = inn_df.drop(['yr_renovated'], axis=1)
In [367]:
#inn_df = inn_df.drop(['zipcode'], axis=1)
In [368]:
#inn_df = inn_df.drop(['lat'], axis=1)
In [369]:
#inn_df = inn_df.drop(['long'], axis=1)
In [370]:
#inn_df = inn_df.drop(['living_measure15'], axis=1)
In [371]:
#inn_df = inn_df.drop(['lot_measure15'], axis=1)
In [372]:
inn_df = inn_df.drop(['furnished'], axis=1)
In [373]:
#inn_df = inn_df.drop(['total_area'], axis=1)
In [374]:
a = np.mean(inn_df, axis=0)
b = np.std(inn_df, axis=0)
c = (a+(3*b))
In [375]:
inn_new = np.where(inn_df > c, c, inn_df)
In [376]:
inn_new1 = pd.DataFrame(inn_new)
In [377]:
y2 = inn_new1.iloc[:,0]
In [378]:
inn2 = inn_new1.iloc[:,1:]
In [379]:
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn2, y2, test_size=test_size, random_state=seed)
In [380]:
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor(n_estimators = 290, learning_rate=0.22)
dt_gb = dt_gb.fit(X_train, y_train)
In [381]:
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
Out[381]:
0.899891188041953

Removing other columns along with the dayhours and id to find out which column helps in improving the model performance

  • room_bed - The model score was 89.896%
  • room_bath - The model score was 89.73%
  • living_measure - The model score was 89.744%
  • lot_measure - The model score was 89.869%
  • ceil - The model score was 89.755%
  • coast - The model score was 89.23%
  • sight - The model score was 89.47%
  • condition - The model score was 89.45%
  • quality - The model score was 89.298%
  • ceil_measure - The model score was 89.665%
  • basement - The model score was 89.938%
  • yr_built - The model score was 89.576%
  • yr_renovated - The model score was 89.635%
  • zipcode - The model score was 89.6%
  • lat - The model score was 88.588%
  • long - The model score was 89.30%
  • living_measure15 - The model score was 89.546%
  • lot_measure15 - The model score was 89.839%
  • furnished - The model score was 89.918%
  • total_area - The model score was 89.672%

From the results we can clearly see that all the model score decreases if we remove anyone attribute. Let us try and check if any two or more attributes are removed simultaneously apart from cid & dayhours, whether the the model score increases or not

Removing two and greater than two attributes

  • lot_measure & basement - The model score was 89.983%

  • lot_measure, lot_measure15 & basement - The model score was 89.936%

  • lot_measure, furnished & basement - The model score was 89.989%

After all the analysis of removing attributes, removal of lot_measure, furnished & basement can yeild the best score of 89.989%. Thus the proposed model can perform with an accuracy of 90% using only 17 attributes out of the 22 attributes from the given dataset, which is what we found from PCA as the minimal no attributes for providing 95% variance.

In [ ]: